The psychological construct of self-regulation, or related concepts of impulsivity, self-control, inhibition and others correlate with problematic real-world behaviors such as disordered eating (Mobbs et al., 2010, Verbeken et al., 2009, Nederkoorn et al. 2006a, Nederkoorn et al., 2006b), gambling (Lawrence et al., 2009, Fuentes et al., 2006, Alessi and Petry, 2003), drug addiction (Coffey et al., 2003, de Wit, Crean and John, 2000, Sher, Bartholow and Wood, 2000, Kirby, Petry and Bickel, 1999) or bad financial decisions (Meier and Sprenger, 2015, Meier and Sprenger, 2013, Meier and Sprenger 2012). In these lines of work measures of self-regulation are used as behavioral assays for individual difference analyses. An underlying assumption of treating behavioral measures as reflective of person-specific characteristics is that the measures of self-regulation are stable across time. In other words, that they have high test-retest reliability (or high between- and low within-subject variability). This assumption is not extensively and equally tested in the literature for all measures of self-regulation.
Self-regulation measures can be grouped in two broad categories: Surveys and cognitive tasks. While the former typically goes through some form of psychometric testing often filtering out items with low test-retest reliability this is less frequently true for the latter. Though some empirical results on test-retest reliabilities of certain cognitive task measures are reported in pockets of the literature an exhaustive summary or systematic elimination of tasks or measures based on stability across time does not seem to exist. Test-retest reliability is, however, crucial for individual difference analyses (Hedge, Powell and Sumner, 2017).
To answer the question of reliability of self-regulation measures comprehensively we created a large battery consisting of both cognitive task and survey measures. In this paper we make two contributions: First we provide a review of these measures along with their reported reliabilities when available. Then we report results of a large study we conducted where we collected retest reliability data using all of these measures. This effort not only fills in a notable gap in the literature in clarifying which measures of self-regulation are stable across time but also provides guidance on factors to pay attention to when constructing new tasks for individual difference measures.
This sample consists of data for 150 subjects of the original sample of 522 that has completed the initial battery of 37 cognitive tasks, 23 surveys and 3 surveys on demographics. Details of the original sample as well as quality control (qc) procedures are described elsewhere (Eisenberg et al., 2017). Invited participants were chosen randomly and only subsets of them were invited for a given batch (instead of opening the battery to all qualified subjects) with the intention to avoid a potential oversampling and bias towards “high self regulators”.
workers = read.csv(paste0(retest_data_path,'Local/User_717570_workers.csv'))
workers = workers %>%
group_by(Worker.ID) %>%
mutate(Retest_worker=ifelse(sum(CURRENT.RetestWorker,CURRENT.RetestWorkerB2,CURRENT.RetestWorkerB3,CURRENT.RetestWorkerB4,CURRENT.RetestWorkerB5,na.rm=T)>0,1,0)) %>%
ungroup()
worker_counts <- fromJSON(paste0(retest_data_path,'/Local/retest_worker_counts.json'))
worker_counts = as.data.frame(unlist(worker_counts))
names(worker_counts) = "task_count"
In total 242 participants were invited, 175 began the battery, 157 completed the battery and 150 provided data that passed qc for both time points. Our target sample size was determined in advance of data collection and data collection continued until this number of participants with data that survived qc was reached.
disc_comp_date = read.csv(paste0(retest_data_path,'Local/discovery_completion_dates.csv'), header=FALSE)
val_comp_date = read.csv(paste0(retest_data_path,'Local/validation_completion_dates.csv'), header=FALSE)
test_comp_date = rbind(disc_comp_date, val_comp_date)
rm(disc_comp_date, val_comp_date)
retest_comp_date = read.csv(paste0(retest_data_path,'Local/retest_completion_dates.csv'), header=FALSE)
comp_dates = merge(retest_comp_date, test_comp_date, by="V1")
names(comp_dates) <- c("sub_id", "retest_comp", "test_comp")
comp_dates$retest_comp = as.Date(comp_dates$retest_comp)
comp_dates$test_comp = as.Date(comp_dates$test_comp)
comp_dates$days_btw = with(comp_dates, retest_comp-test_comp)
Data collection took place on average 115 number of days after the completion of the initial battery with a range of 60 to 228 days.
rm(test_comp_date, retest_comp_date, comp_dates)
test_demog <- read.csv(paste0(retest_data_path, '/t1_data/demographic_health.csv'))
retest_demog <- read.csv(paste0(retest_data_path, 'demographic_health.csv'))
retest_demog = retest_demog[retest_demog$X %in% test_demog$X,]
names(test_demog)[which(names(test_demog) == 'X')] <-'sub_id'
names(retest_demog)[which(names(retest_demog) == 'X')] <-'sub_id'
summary(test_demog %>%
select(Sex, Age))
Sex Age
Min. :0.00 Min. :21.0
1st Qu.:0.00 1st Qu.:28.0
Median :1.00 Median :33.0
Mean :0.52 Mean :34.1
3rd Qu.:1.00 3rd Qu.:38.8
Max. :1.00 Max. :59.0
summary(retest_demog %>%
select(Sex, Age))
Sex Age
Min. :0.000 Min. :21.0
1st Qu.:0.000 1st Qu.:29.0
Median :1.000 Median :33.0
Mean :0.527 Mean :34.5
3rd Qu.:1.000 3rd Qu.:38.8
Max. :1.000 Max. :60.0
One of the major contributions of this project is a comprehensive literature review of the retest reliabilities of the surveys and tasks that were used. We reviewed the literature on a measure (as opposed to task level) paying attention to differences in sample size, the delay between the two measurements as well as the statistic that was used to assess reliabilities (e.g. Spearman vs. Pearson correlations). Here we present a table and a visualization summarizing our findings. References mentioned in the table below can be found here.
lit_review <- read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/lit_review_figure.csv')
lit_review
lit_review = lit_review %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)]),
type = as.character(type)) %>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
arrange(task_group, raw_fit, var) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
select(-measure_description)
Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
else paste0(labels, : duplicated levels in factors are deprecated
tmp = lit_review[duplicated(lit_review$reference)==FALSE,]
nrow(tmp)
[1] 154
sum(tmp$sample_size)
[1] 17550
nrow(lit_review)
[1] 583
rm(tmp)
Measure level plot
lit_review = lit_review %>%
select(-reference)
Because this plot is difficult to digest we summarize it on a task level to give a general sense of the main takeaways. This plot naturally disregards much of the fine grained information.
p1_t_legend <- lit_review %>%
filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='black')+
theme(axis.text.y = element_text(size=43),
legend.position = 'bottom',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30),
legend.text = element_text(size=20),
legend.key.width = unit(0.75, "inches"),
legend.title = element_text(size=28),
legend.spacing.x = unit(0.5, "inches")) +
guides(size = guide_legend(override.aes = list(size=c(9,18,28))),
shape = guide_legend(override.aes = list(size=18)))+
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3), name="Type")+
scale_size_continuous(name = "Sample Size")+
geom_vline(xintercept = 0, color = "red", size = 1)
p1_t <- lit_review %>%
filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='#00BFC4')+
theme(axis.text.y = element_text(size=43),
legend.position = 'none',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30)) +
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
scale_size_continuous(range = c(5, 35))+
geom_vline(xintercept = 0, color = "red", size = 1)
p2_t <- lit_review %>%
filter(task == 'survey') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='#F8766D')+
theme(axis.text.y = element_text(size=43),
legend.position = 'none',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30)) +
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
scale_size_continuous(range = c(5, 35))+
geom_vline(xintercept = 0, color = "red", size = 1)
mylegend<-g_legend(p1_t_legend)
p3_t <- arrangeGrob(arrangeGrob(p1_t, p2_t, nrow=1), mylegend, nrow=2,heights=c(10, 1))
ggsave('Lit_Review_Plot_t.jpg', plot = p3_t, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 72)
rm(p1_t, p2_t, p3_t, mylegend, p1_t_legend)
An interactive version of this plot could be find here
Takeaways from this review are:
- Survey measures have been reported to higher reliability compared to task measures
- Survey measures have less variability in the reported reliabiltiy estimates compared to task measures
The variables included in this report are:
- meaningful variables (includes only hdddm parameters)
- EZ diffusion parameters
- Raw RT and Accuracy measures
- Variables found in the literature (for comparison)
test_data <- read.csv(paste0(retest_data_path,'t1_data/variables_exhaustive.csv'))
test_data <- test_data[,names(test_data) %in% retest_report_vars]
test_data$X <- as.character(test_data$X)
names(test_data)[which(names(test_data) == 'X')] <-'sub_id'
For reference here are the variables that are not included in the analyses of the remainder of this report because they were not of theoretical interest in factor structure analyses of this data so far. These include drift diffusion and other model parameters for specific conditions within a task; survey variables that are not part of the dependant variables for that survey in the literature and demographics (these are saved for prediction analyses).
test_data2 <- read.csv(paste0(retest_data_path, 't1_data/variables_exhaustive.csv'))
df <- data.frame(names(test_data2)[which(names(test_data2) %in% names(test_data) == FALSE)])
names(df) = c('vars')
df
retest_data <- read.csv(paste0(retest_data_path,'variables_exhaustive.csv'))
retest_data <- retest_data[,names(retest_data) %in% retest_report_vars]
retest_data$X <- as.character(retest_data$X)
names(retest_data)[which(names(retest_data) == 'X')] <-'sub_id'
retest_data = retest_data[retest_data$sub_id %in% test_data$sub_id,]
Since HDDM parameters depend on the sample on which they are fit we refit the model on t1 data for the subjects that have t2 data. Here we replace the HDDM parameters in the current t1 dataset with these refitted values.
hddm_refits <- read.csv(paste0(retest_data_path,'t1_data/hddm_refits_exhaustive.csv'))
hddm_refits = hddm_refits[,names(hddm_refits) %in% retest_report_vars]
hddm_refits$X <- as.character(hddm_refits$X)
names(hddm_refits)[which(names(hddm_refits) == 'X')] <-'sub_id'
#For later comparison of whether fitting the DDM parameters on full or retest sample makes a big difference
test_data_full_sample_hddm <- test_data
#drop hddm columns from test_data
test_data = cbind(test_data$sub_id, test_data[,names(test_data) %in% names(hddm_refits) == FALSE])
#fix naming before merging
names(test_data)[which(names(test_data) == 'test_data$sub_id')] <-'sub_id'
#merge hddm refits to test data
test_data = merge(test_data, hddm_refits, by="sub_id")
Point estimates of reliability for the demographic variabels.
process_boot_df = function(df){
df = df %>%
drop_na() %>%
mutate(dv = as.character(dv),
icc = as.numeric(as.character(icc)),
spearman = as.numeric(as.character(spearman)),
pearson = as.numeric(as.character(pearson)),
eta_sq = as.numeric(as.character(eta_sq)),
sem = as.numeric(as.character(sem)),
partial_eta_sq = as.numeric(as.character(partial_eta_sq)),
omega_sq = as.numeric(as.character(omega_sq)),
var_subs = as.numeric(as.character(var_subs)),
var_ind = as.numeric(as.character(var_ind)),
var_resid = as.numeric(as.character(var_resid)),
F_time = as.numeric(as.character(F_time)),
p_time = as.numeric(as.character(p_time)),
df_time = as.numeric(as.character(df_time)),
df_resid = as.numeric(as.character(df_resid)))
return(df)}
demog_boot_df <- read.csv(gzfile(paste0(retest_data_path,'demog_boot_merged.csv.gz')))
demog_boot_df = process_boot_df(demog_boot_df)
tmp = demog_boot_df %>%
group_by(dv) %>%
summarise(median_icc = quantile(icc, probs=0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975)) %>%
arrange(-median_icc)
tmp
sjt.df(tmp%>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/demog_rel_table.doc")
| dv | median_icc | icc_2.5 | icc_97.5 |
|---|---|---|---|
| HispanicLatino | 1 | 1 | 1 |
| Age | 0.998 | 0.998 | 0.999 |
| Sex | 0.99 | 0.989 | 1 |
| HouseholdIncome | 0.988 | 0.98 | 0.994 |
| ArrestedChargedLifeCount | 0.957 | 0.916 | 0.978 |
| WeightPounds | 0.951 | 0.914 | 0.992 |
| ChildrenNumber | 0.951 | 0.911 | 0.998 |
| HighestEducation | 0.95 | 0.921 | 0.983 |
| HowLongSmoked | 0.949 | 0.923 | 0.976 |
| RelationshipNumber | 0.947 | 0.931 | 0.962 |
| LifetimeSmoke100Cigs | 0.944 | 0.913 | 0.97 |
| CigsPerDay | 0.944 | 0.914 | 0.972 |
| AlcoholHowManyDrinksDay | 0.936 | 0.912 | 0.956 |
| HowOftenMemoryConcentrationProblemCannabis | 0.935 | 0.434 | 0.974 |
| RetirementPercentStocks | 0.929 | 0.878 | 0.97 |
| MortgageDebt | 0.929 | 0.889 | 0.964 |
| LongestRelationship | 0.928 | 0.883 | 0.963 |
| HowOftenFailedActivitiesDrinking | 0.927 | 0.811 | 0.967 |
| RentOwn | 0.926 | 0.889 | 0.963 |
| CoffeeCupsPerDay | 0.924 | 0.891 | 0.955 |
| SmokeEveryDay | 0.916 | 0.878 | 0.953 |
| HeightInches | 0.904 | 0.856 | 0.953 |
| CannabisHowOften | 0.899 | 0.83 | 0.962 |
| EducationDebt | 0.898 | 0.841 | 0.96 |
| HowOftenDevotedTimeCannabis | 0.887 | 0.643 | 1 |
| MedicalProblemsDueToDrugUse | 0.887 | 0.798 | 1 |
| DivorceCount | 0.881 | 0.791 | 1 |
| AlcoholHowOften | 0.879 | 0.824 | 0.934 |
| RelationshipStatus | 0.869 | 0.799 | 0.929 |
| HowSoonSmokeAfterWaking | 0.866 | 0.794 | 0.927 |
| RetirementAccount | 0.859 | 0.801 | 0.914 |
| AlcoholHowOften6Drinks | 0.85 | 0.789 | 0.92 |
| RelativeFriendConcernedDrinking | 0.843 | 0.756 | 0.896 |
| CannabisPast6Months | 0.84 | 0.759 | 0.905 |
| CreditCardDebt | 0.837 | 0.782 | 0.91 |
| HowOftenCantStopDrinking | 0.832 | 0.711 | 0.907 |
| TeaCupsPerDay | 0.831 | 0.752 | 0.907 |
| CaffienatedSodaCansPerDay | 0.831 | 0.765 | 0.906 |
| DoctorVisitsLastMonth | 0.83 | 0.009 | 0.908 |
| HowOftenGuiltRemorseDrinking | 0.829 | 0.706 | 0.894 |
| HowOftenDrinkMorning | 0.824 | -0.048 | 0.946 |
| CannabisHoursStoned | 0.817 | 0.648 | 0.938 |
| CannabisConsideredReduction | 0.809 | 0.545 | 0.93 |
| NeurologicalDiagnoses | 0.798 | -0.022 | 1 |
| FeelBadGuiltyDrugUse | 0.784 | 0.591 | 0.917 |
| HowOftenUnableRememberDrinking | 0.782 | 0.564 | 0.907 |
| TrafficAccidentsLifeCount | 0.782 | 0.679 | 0.853 |
| HowOftenCantStopCannabis | 0.746 | 0.143 | 1 |
| InjuredDrinking | 0.736 | 0.611 | 0.822 |
| DaysHalfLastMonth | 0.721 | 0.58 | 0.795 |
| CarDebt | 0.712 | 0.601 | 0.835 |
| DaysPhysicalHealthFeelings | 0.704 | 0.55 | 0.817 |
| Worthless | 0.703 | 0.578 | 0.806 |
| Depressed | 0.691 | 0.569 | 0.789 |
| EverythingIsEffort | 0.684 | 0.566 | 0.785 |
| Hopeless | 0.677 | 0.552 | 0.782 |
| OtherDrugs | 0.658 | 0.524 | 0.767 |
| TrafficTicketsLastYearCount | 0.632 | 0.197 | 0.837 |
| RestlessFidgety | 0.618 | 0.472 | 0.755 |
| Nervous | 0.609 | 0.465 | 0.742 |
| NeglectedFamilyDrugUse | 0.557 | -0.033 | 0.798 |
| BlackoutFlashbackDrugUse | 0.477 | -0.051 | 0.798 |
| EngagedInIllegalActsToObtainDrugs | 0.476 | -0.071 | 0.715 |
| DaysLostLastMonth | 0.473 | 0.295 | 0.747 |
| AbleToStopDrugs | 0.454 | 0.134 | 0.626 |
| WidthdrawalSymptoms | 0.412 | -0.087 | 0.65 |
| HowOftenFailedActivitiesCannabis | 0.404 | 0.124 | 0.938 |
| Last30DaysUsual | 0.398 | 0.142 | 0.59 |
| GamblingProblem | 0.322 | -0.041 | 0.715 |
| AbuseMoreThanOneDrugAtATime | 0.272 | -0.124 | 0.539 |
| HowOftenHazardousCannabis | 0.044 | -0.072 | 1 |
| CaffieneOtherSourcesDayMG | 0 | 0 | 0.832 |
| SpouseParentsComplainDrugUse | -0.03 | -0.055 | -0.021 |
| OtherDebtAmount | -0.069 | -0.498 | 0.315 |
Point estimates of reliability for demog items
numeric_cols = c()
for(i in 1:length(names(test_demog))){
if(is.numeric(test_demog[,i])){
numeric_cols <- c(numeric_cols, names(test_demog)[i])
}
}
demog_rel_df <- data.frame(icc = rep(NA, length(numeric_cols)))
row.names(demog_rel_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
demog_rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
}
demog_rel_df = demog_rel_df %>%
mutate(var = row.names(.)) %>%
select(var, icc) %>%
arrange(-icc)
Checking low reliability demog vars: Mostly NA’s
test_demog$HowOftenHazardousCannabis
[1] NA 0 NA 0 NA 0 NA NA NA 0 NA NA 0 NA 0 NA NA NA 0 0 0 NA 0
[24] NA 0 NA 0 0 NA 0 0 NA 0 NA 0 NA 4 0 0 0 0 0 NA 0 0 1
[47] NA 0 0 0 NA 0 NA 0 NA NA NA 0 NA 0 0 NA NA NA 0 0 0 0 NA
[70] 0 0 0 NA NA 0 NA NA 0 NA 0 0 NA NA 0 0 NA 0 NA 0 0 NA NA
[93] 0 0 NA NA 0 NA 0 NA NA NA NA NA 0 0 0 0 0 NA 0 NA 0 0 NA
[116] 0 NA 0 NA 0 NA NA NA 0 0 NA 0 0 0 0 NA 0 NA 1 NA 0 0 0
[139] NA NA 0 NA NA 0 NA 0 0 NA 0 NA
retest_demog$HowOftenHazardousCannabis
[1] 0 0 0 0 NA 0 NA NA 0 NA NA NA 0 NA NA NA 0 NA 0 0 0 NA NA
[24] NA 0 NA 0 0 NA NA 0 NA 0 0 0 NA 0 0 4 NA 0 0 NA 0 0 1
[47] NA 0 NA 0 NA NA NA NA NA 0 0 0 0 0 0 NA 0 NA 0 NA 0 0 NA
[70] 0 0 0 NA NA 0 0 0 0 NA 0 0 0 NA NA 0 NA 0 NA 0 0 NA 0
[93] 0 0 NA 0 NA NA NA NA NA NA 0 0 NA 0 0 NA NA NA 0 0 0 0 NA
[116] 0 NA NA NA 0 NA 0 NA NA NA NA NA 0 0 NA NA 0 NA 0 NA 0 NA 0
[139] 0 0 0 0 NA 0 NA 0 0 NA 0 0
test_demog$CaffieneOtherSourcesDayMG
[1] 0 0 0 160 200 120 400 0 0 0 0 0 0 0 0 0 0
[18] 0 0 0 0 0 0 0 0 0 0 0 0 35 0 0 0 250
[35] 200 40 -4 10 500 0 0 0 0 120 150 0 0 0 0 0 0
[52] 0 0 0 0 0 75 150 0 0 0 0 0 0 0 0 0 0
[69] 0 65 0 300 0 40 0 0 0 600 0 350 150 114 0 500 180
[86] 0 0 0 0 0 0 8 100 0 0 600 0 0 0 0 0 0
[103] 0 0 0 0 0 500 95 0 0 0 0 0 0 3 0 0 0
[120] 0 0 0 0 0 0 209 0 700 100 0 0 55 200 0 0 200
[137] 0 0 0 0 0 3 0 0 0 0 0 0 0 200
retest_demog$CaffieneOtherSourcesDayMG
[1] 0 0 0 0 400 0 0
[8] 200 0 0 0 210 0 0
[15] 0 0 2 0 0 0 0
[22] 0 0 0 1500 0 0 0
[29] 0 0 0 0 0 0 0
[36] 40 0 0 0 0 0 20
[43] 0 0 0 0 0 0 0
[50] 0 99999999 0 0 0 0 0
[57] 0 250 0 0 0 0 0
[64] 0 0 0 0 0 0 65
[71] 0 400 0 0 50 0 0
[78] 1200 200 400 95 60 0 250
[85] 57 0 0 30 0 0 0
[92] 0 0 10 100 1000 125 0
[99] 0 150 10 0 12 0 0
[106] 200 0 100 0 20 0 0
[113] 0 0 0 5 30 0 0
[120] 50 0 0 0 0 0 160
[127] 0 600 0 4 0 100 150
[134] 0 0 0 0 0 0 0
[141] 0 100 400 0 0 0 0
[148] 0 95 0
test_demog$SpouseParentsComplainDrugUse
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
[36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[106] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
[141] 0 0 0 0 0 0 0 0 0 0
retest_demog$SpouseParentsComplainDrugUse
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[36] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[141] 0 0 0 0 0 0 0 0 0 0
test_demog$OtherDebtAmount
[1] NA NA 1 5 NA NA NA NA 1 1 NA NA NA 2 NA NA NA NA NA NA 6 NA 1
[24] NA NA NA NA NA NA NA NA NA NA 3 1 NA NA 1 NA NA 1 1 3 1 1 1
[47] 1 1 NA NA 1 1 1 NA NA 1 1 1 NA 1 NA NA NA NA 2 1 NA 1 NA
[70] NA NA 1 NA 3 4 NA 4 NA NA 1 5 1 NA 1 1 NA 5 NA NA NA NA NA
[93] 1 NA NA 1 1 NA 6 NA NA NA NA NA NA NA NA NA 1 NA 1 8 NA NA 1
[116] NA NA NA NA NA NA NA 1 NA NA 1 NA 1 1 NA NA 1 NA NA NA 1 2 1
[139] NA 1 NA NA NA 1 NA 1 1 1 2 NA
retest_demog$OtherDebtAmount
[1] 1 1 NA 1 NA NA NA 1 NA NA NA NA 1 NA 1 NA 1 NA NA NA 1 NA 1
[24] NA 2 NA 1 NA NA 1 2 NA NA NA 1 NA NA 1 NA 1 1 1 1 NA NA 1
[47] 4 NA NA 1 NA NA 3 1 NA NA 6 NA NA 1 NA 1 NA NA NA 1 NA 2 NA
[70] NA NA 1 NA 3 4 1 1 1 2 NA 1 1 1 1 1 NA NA NA 1 NA NA 1
[93] 1 1 1 1 NA NA 1 NA 1 1 3 NA NA 1 NA NA 1 NA 1 NA NA NA 1
[116] 1 NA 1 NA NA NA 1 NA 1 NA 1 NA 1 1 1 NA 1 NA 1 NA 1 1 3
[139] NA 1 1 NA NA 5 NA 1 NA 1 NA NA
Plotting point estimates of reliability for demographic items coloring those that have a time limit darker in the histogram.
timed_demogs <- c("DoctorVisitsLastMonth","DaysHalfLastMonth","Worthless","DaysPhysicalHealthFeelings","Depressed","EverythingIsEffort","Hopeless","RestlessFidgety","Nervous","DaysLostLastMonth","Last30DaysUsual")
demog_rel_df %>%
mutate(timed = ifelse(var %in% timed_demogs, 1, 0)) %>%
ggplot(aes(icc, fill = factor(timed)))+
geom_histogram(position = "identity", alpha = 0.7, color = NA)+
xlab("ICC")+
ylab("Count")+
scale_fill_manual(values = c("grey", "grey25"))+
theme(legend.position = "none")
ggsave('DemogRelDist.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 5, units = "in", dpi=500)
Due to our data collection strategy we did not strictly control for the delay between the two measurements as standard psychometrics studies measuring retest reliability might.
An individual difference measure would preferably remain stable regardless of the delay between multiple measurements.
Since we only had two measurements we could not test directly whether a measure becomes less reliable depending on the delay between the two time points (The average number of days between two measurements would be the same for all measures since reliability is a measure level metric while days between completion a subject level one).
So if you regress the average difference for each measure on average delay between two measurements you are regressing a vector with varying numbers on a vector of same values, like a t test asking if the mean of the varying column, in this case the average difference score, is different than the unique value in the single value column, i.e. the average delay between two time points (Note that this should be true in theory but in practice since for each measure there might subjects for whom the dv could not be calculated there might be >1 unique values for the average delay between the two time points). This analysis is not meaningful.
Instead of the summary metric like the retest reliability estimate for each measure we can check whether the difference score distribution for each measure depends on the delay between the two measurements. Since the difference score distribution is at subject level we can check whether the order of subjects in this distribution depends on their order in the distribution of days between completing the two tests.
Make data frame with difference between two scores for each measure for each subject. Since the scores for different measures are on different scales for comparability the difference scores are normalized (ie demeaned and dividied by the sd of the difference score distribution.) Note that the variance of the difference score distribution accounts for the variance in both time points by summing them. Normalization equates the means of each difference score distribution to 0 which would mask any meaningful change between the two time points but the analysis here does not interpret the mean of the difference score distributions but is interested in its relation to the days between completion. We check if the variables show systematic differences between the two points later.
Here we check if the difference is larger the longer the delay
numeric_cols = c()
for(i in 1:length(names(test_data))){
if(is.numeric(test_data[,i]) & names(test_data)[i] %in% names(retest_data)){
numeric_cols <- c(numeric_cols, names(test_data)[i])
}
}
t1_2_difference = data.frame()
for(i in 1:length(numeric_cols)){
tmp = match_t1_t2(numeric_cols[i],format='wide')
tmp = tmp %>%
mutate(difference = scale(`2` - `1`))
t1_2_difference = rbind(t1_2_difference, tmp)
}
t1_2_difference$difference = as.data.frame(t1_2_difference$difference)$V1
t1_2_difference = t1_2_difference %>% separate(dv, c("task", "dv2"), sep="\\.", remove=FALSE)
Add completion dates to this data frame.
retest_task_comp_times = read.csv(paste0(retest_data_path, 'Local/retest_task_completion_times.csv'))
test_task_comp_times = read.csv(paste0(retest_data_path, 'Local/test_task_completion_times.csv'))
task_comp_times = merge(retest_task_comp_times, test_task_comp_times, by=c('worker_id','task'))
rm(retest_task_comp_times, test_task_comp_times)
task_comp_times = task_comp_times %>%
select(-X.x, -X.y) %>%
mutate(finish_day.x = as.Date(finish_day.x),
finish_day.y = as.Date(finish_day.y),
days_btw = finish_day.x-finish_day.y) %>%
rename(sub_id=worker_id)
t1_2_difference = merge(t1_2_difference, task_comp_times[,c('sub_id', 'task','days_btw')], by=c('sub_id', 'task'))
What does the distribution of differences look like: The distribution of differences between two time points for each measure
t1_2_difference %>%
ggplot(aes(difference, alpha=dv))+
geom_histogram(position='identity')+
theme(legend.position = 'none')
How do the difference score distributions look like with respect to the days between completion?
t1_2_difference %>%
ggplot()+
geom_smooth(aes(as.numeric(days_btw), abs(difference), group=factor(dv)), method='lm', se=FALSE, color = 'grey', alpha = 0.5)+
geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
theme(legend.title = element_blank())+
xlab('Days between completion')+
ylab('Scaled difference score')
ggsave('DaysBtwEffect.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 4, units = "in", limitsize = FALSE, dpi = 500)
To test if the slope of the black is significant we would run a mixed effects model with a fixed effect for days between completion, random slope for each dv depending on the days between and random intercept for each dv.
Before I was using subjects as a random effect but days between the two time points for each measure depends on subj id. What varies randomly is which dv we are looking for its distribution of differences in relation to the days between the time points. So I changed the model to have fixed effect for the days between, a random slope for (dependent variables can be differentially sensitive to th effect of days between) and a random intercept for dependent variable.
Significant fixed effect suggests that on average the longer the delay the smaller the difference.
summary(lmerTest::lmer(abs(difference) ~ scale(days_btw)+(scale(days_btw) | dv), data=t1_2_difference))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: abs(difference) ~ scale(days_btw) + (scale(days_btw) | dv)
Data: t1_2_difference
REML criterion at convergence: 132273
Scaled residuals:
Min 1Q Median 3Q Max
-1.127 -0.719 -0.260 0.402 15.807
Random effects:
Groups Name Variance Std.Dev. Corr
dv (Intercept) 0.003190 0.0565
scale(days_btw) 0.000149 0.0122 1.00
Residual 0.468366 0.6844
Number of obs: 63454, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.72218 0.00382 457.00000 189.05 <2e-16 ***
scale(days_btw) -0.00985 0.00278 3236.00000 -3.55 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
scl(dys_bt) 0.146
But if I just run one mixed effects model then we don’t get a sense of the simple effects (how many of the variables this effect of the days between is significant and in which direction). I can run it separately for each dv to see if all difference score distributions are affected the same way depending on the days between completion. But since we are running so many tests we need to correct for multiple comparisons. How many of these tests are significant FDR correcting? None.
get_delay_effect = function(df){
mod = lm(abs(difference) ~ scale(days_btw), data = df)
out = data.frame(estimate=coef(summary(mod))["scale(days_btw)","Estimate"], pval=coef(summary(mod))["scale(days_btw)","Pr(>|t|)"])
return(out)
}
days_effect = t1_2_difference %>%
group_by(dv) %>%
do(get_delay_effect(.))
#Correct p-values for multiple comparisons
sum(p.adjust(days_effect$pval, method = "fdr") < 0.05)
[1] 0
Conclusion: The average difference between the scores of a measure doesn’t change with the increased delay.
Some surveys have overlapping questions. Do these correlate within and across sessions?
First determine the overlapping questions.
tmp = read.csv(gzfile(paste0(retest_data_path, 'items.csv.gz')))
tmp = tmp %>%
filter(worker == 's005') %>%
select(item_ID, item_text) %>%
mutate(item_text = trimws(as.character(item_text))) %>%
unite(item, c("item_ID", "item_text"), sep = "___")
comb = as.data.frame(t(combn(unique(tmp$item),2)))
duplicate_items = comb %>%
filter(grepl('dospert', V1)==FALSE) %>%
filter(grepl('selection_optimization', V1)==FALSE) %>%
filter(grepl('sensation_seeking', V1)==FALSE) %>%
separate(V1, c("item1_ID", "item1_text"), sep="___") %>%
separate(V2, c("item2_ID", "item2_text"), sep="___") %>%
mutate(similarity = levenshteinSim(item1_text, item2_text)) %>%
filter(similarity>0.8) %>%
select(item1_ID, item2_ID, item1_text, item2_text)
duplicate_items
#surveys to read in
extract_items = c('worker',unique(with(duplicate_items, c(item1_ID, item2_ID))))
#correlations to compute:
#item1_t1 - item2_t1,
#item1_t2 - item2_t2,
#item1_t1 - item2_t2,
#item1_t2 - item2_t1
duplicate_items_data_t1 = read.csv(paste0(test_data_path, 'subject_x_items.csv'))
duplicate_items_data_t2 = read.csv(paste0(retest_data_path, 'subject_x_items.csv'))
duplicate_items_data_t1 = duplicate_items_data_t1 %>%
filter(worker %in% duplicate_items_data_t2$worker) %>%
select(extract_items)
duplicate_items_data_t2=duplicate_items_data_t2 %>%
filter(worker %in% duplicate_items_data_t1$worker) %>%
select(extract_items)
duplicate_items = duplicate_items %>%
mutate(t1_t1_cor = NA,
t2_t2_cor = NA,
t1_t2_cor = NA,
t2_t1_cor = NA,
t1_t1_polycor = NA,
t2_t2_polycor = NA,
t1_t2_polycor = NA,
t2_t1_polycor = NA)
for(i in 1:nrow(duplicate_items)){
duplicate_items$t1_t1_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t2_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t2_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t1_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}
for(i in 1:nrow(duplicate_items)){
duplicate_items$t1_t1_cor[i] = abs(cor(scale(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t2_t2_cor[i] = abs(cor(scale(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t1_t2_cor[i] = abs(cor(scale(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t2_t1_cor[i] = abs(cor(scale(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])]),
scale(duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])])))
duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}
duplicate_items
summary(duplicate_items$t1_t1_cor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.357 0.557 0.669 0.625 0.723 0.788
summary(duplicate_items$t2_t2_cor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.244 0.497 0.684 0.613 0.736 0.762
summary(duplicate_items$t1_t2_cor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.150 0.437 0.550 0.507 0.602 0.704
summary(duplicate_items$t2_t1_cor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.232 0.359 0.598 0.516 0.647 0.683
summary(duplicate_items$t1_t1_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.284 0.355 0.583 0.567 0.742 0.822
summary(duplicate_items$t2_t2_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.101 0.446 0.612 0.580 0.700 0.904
summary(duplicate_items$t1_t2_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0128 0.3570 0.5710 0.5160 0.6870 0.8380
summary(duplicate_items$t2_t1_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0631 0.4490 0.5350 0.5010 0.5830 0.8190
Time 1: item 1 - item 2 Time 1 item 1 - time 2 item 2 Time 2 item 1 - time 1 item 2 Time 2: item 1 - item 2
tmp = duplicate_items_data_t1 %>%
gather(key, value, -worker)
tmp2 = duplicate_items_data_t2 %>%
gather(key, value, -worker)
tmp = duplicate_items %>%
select(item1_ID, item2_ID) %>%
left_join(tmp, by = c("item1_ID" = "key")) %>%
left_join(tmp, by = c("item2_ID" = "key", "worker" = "worker")) %>%
rename(item1_time1 = value.x, item2_time1 = value.y) %>%
left_join(tmp2, by = c("item1_ID" = "key", "worker" = "worker")) %>%
left_join(tmp2, by = c("item2_ID" = "key", "worker" = "worker")) %>%
rename(item1_time2 = value.x, item2_time2 = value.y) %>%
group_by(item1_ID) %>%
mutate(item1_time1 = scale(item1_time1),
item1_time2 = scale(item1_time2),
item2_time1 = scale(item2_time1),
item2_time2 = scale(item2_time2),
item1_time1 = ifelse(item1_ID == "mpq_control_survey.13", item1_time1*-1, item1_time1),
item1_time2 = ifelse(item1_ID == "mpq_control_survey.13", item1_time2*-1, item1_time2)) %>%
ungroup()
p1 = tmp %>%
ggplot(aes(item1_time1, item2_time1, col=item1_ID))+
geom_smooth (alpha=0.3, size=0, span=0.5, method = "lm")+
stat_smooth (geom="line", alpha=1, size=1, span=0.5, method= "lm")+
theme(legend.position = "none")+
geom_abline(slope=1, intercept=0, size = 2, linetype = "dashed")+
xlab("Item 1 T1")+
ylab("Item 2 T1")
p2 = tmp %>%
ggplot(aes(item1_time1, item2_time2, col=item1_ID))+
geom_smooth (alpha=0.3, size=0, span=0.5, method = "lm")+
stat_smooth (geom="line", alpha=1, size=1, span=0.5, method= "lm")+
theme(legend.position = "none")+
geom_abline(slope=1, intercept=0, size = 2, linetype = "dashed")+
xlab("Item 1 T1")+
ylab("Item 2 T2")
p3 = tmp %>%
ggplot(aes(item1_time2, item2_time1, col=item1_ID))+
geom_smooth (alpha=0.3, size=0, span=0.5, method = "lm")+
stat_smooth (geom="line", alpha=1, size=1, span=0.5, method= "lm")+
theme(legend.position = "none")+
geom_abline(slope=1, intercept=0, size = 2, linetype = "dashed")+
xlab("Item 1 T2")+
ylab("Item 2 T1")
p4 = tmp %>%
ggplot(aes(item1_time2, item2_time2, col=item1_ID))+
geom_smooth (alpha=0.3, size=0, span=0.5, method = "lm")+
stat_smooth (geom="line", alpha=1, size=1, span=0.5, method= "lm")+
theme(legend.position = "none")+
geom_abline(slope=1, intercept=0, size = 2, linetype = "dashed")+
xlab("Item 1 T2")+
ylab("Item 2 T2")
p5 = arrangeGrob(p1, p2, p3, p4, ncol=2)
ggsave('DataCheckOverlappingItems.jpg', p5, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 8, height = 5, units = "in", limitsize = FALSE, dpi = 500)
rm(tmp, tmp2, p1, p2, p3, p4, p5)
Warning in rm(tmp, comb, t1_2_difference, duplicate_items,
duplicate_items_data_t1, : object 'tmp' not found
Read in and process bootstrapped results.
boot_df <- read.csv(gzfile(paste0(retest_data_path,'bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
boot_df = process_boot_df(boot_df)
boot_df = boot_df[boot_df$dv %in% retest_report_vars,]
# Check if you have all variables bootstrapped
# retest_report_vars[which(retest_report_vars %in% boot_df$dv==FALSE)]
# Boot df contains hddm parameters fit on the full sample in the t1 data
# refits_bootstrap_merged.csv.gz contains bootstrapped reliabilities
refit_boot_df = read.csv(gzfile(paste0(retest_data_path,'refits_bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
refit_boot_df = process_boot_df(refit_boot_df)
fullfit_boot_df = boot_df[as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]
boot_df = boot_df[!as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]
boot_df = rbind(boot_df, refit_boot_df)
rm(refit_boot_df)
Summarize bootstrapped results and merge to lit review data
var_boot_df = boot_df %>%
group_by(dv) %>%
summarise(mean_icc = mean(icc),
mean_pearson = mean(pearson))
rel_comp = lit_review %>%
left_join(var_boot_df, by = 'dv')
Warning: Column `dv` joining factor and character vector, coercing into
character vector
Here’s what our data looks like: (583 data points for 171 measures)
rel_comp
Distribution of reliabilities, sample sizes and delays
rel_comp %>%
select(dv, task, retest_reliability, sample_size, days) %>%
filter(days < 3600) %>%
gather(key, value, -dv, -task) %>%
ggplot(aes(value, fill=task))+
geom_density(alpha=0.5, position='identity')+
facet_wrap(~key, scales='free')+
theme(legend.title = element_blank())
summary(rel_comp$sample_size[rel_comp$task == "survey"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 38 73 126 153 791
summary(rel_comp$sample_size[rel_comp$task == "task"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 30.0 47.0 78.6 81.0 1120.0
rel_comp %>%
group_by(task) %>%
summarise(mean_sample_size = mean(sample_size),
sem_sample_size = sem(sample_size)) %>%
ggplot(aes(task,mean_sample_size,col=task))+
geom_point(size=5)+
geom_errorbar(aes(ymin=mean_sample_size-sem_sample_size, ymax=mean_sample_size+sem_sample_size), width = 0, size=2)+
xlab("")+
ylab("Sample size")+
theme(legend.position = 'none',
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=14))
ggsave('LitReviewSampleSize.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 3, height = 5, units = "in", limitsize = FALSE, dpi = 500)
The literature has smaller sized samples for task measures compared to survey measures that report retest reliability.
summary(lm(sample_size ~ task,rel_comp))
Call:
lm(formula = sample_size ~ task, data = rel_comp)
Residuals:
Min 1Q Median 3Q Max
-115.5 -57.6 -33.6 7.4 1043.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.50 8.47 14.94 < 2e-16 ***
tasktask -47.89 10.79 -4.44 0.000011 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 127 on 581 degrees of freedom
Multiple R-squared: 0.0328, Adjusted R-squared: 0.0311
F-statistic: 19.7 on 1 and 581 DF, p-value: 0.0000108
What predicts retest reliability in the literature? Task, sample size, days
mod1 = lmer(retest_reliability ~ task + (1|dv), rel_comp)
mod2 = lmer(retest_reliability ~ task + sample_size + (1|dv), rel_comp)
anova(mod1, mod2)
refitting model(s) with ML (instead of REML)
mod3 = lmer(retest_reliability ~ task + sample_size + days+ (1|dv), rel_comp)
anova(mod2, mod3)
refitting model(s) with ML (instead of REML)
summary(mod2)
Linear mixed model fit by REML ['lmerMod']
Formula: retest_reliability ~ task + sample_size + (1 | dv)
Data: rel_comp
REML criterion at convergence: -377.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.190 -0.488 0.127 0.550 2.575
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.0174 0.132
Residual 0.0206 0.143
Number of obs: 583, groups: dv, 171
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.7235198 0.0213512 33.9
tasktask -0.1405121 0.0258197 -5.4
sample_size -0.0001256 0.0000543 -2.3
Correlation of Fixed Effects:
(Intr) tsktsk
tasktask -0.780
sample_size -0.319 0.117
Are the residuals of this model heteroscedastic? Yes the residuals do not depend on the predictor.
tmp = data.frame(rel_comp$sample_size, resid(mod2)) %>%
rename(sample_size = rel_comp.sample_size, resid = resid.mod2.)
tmp %>%
ggplot(aes(sample_size, resid))+
geom_point()+
geom_smooth(method = "lm")
summary(lm(resid ~ sample_size, tmp))
Call:
lm(formula = resid ~ sample_size, data = tmp)
Residuals:
Min 1Q Median 3Q Max
-0.4573 -0.0700 0.0182 0.0788 0.3691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.80e-15 6.68e-03 0 1
sample_size 9.01e-18 4.15e-05 0 1
Residual standard error: 0.129 on 581 degrees of freedom
Multiple R-squared: 7.95e-29, Adjusted R-squared: -0.00172
F-statistic: 4.62e-26 on 1 and 581 DF, p-value: 1
Tasks have significantly lower reliability and reliability decreases with increasing sample size.
rel_comp %>%
ggplot(aes(sample_size, retest_reliability, color=task))+
geom_smooth(method='lm')+
# geom_point(alpha = 0.2)+
theme(legend.title = element_blank(),
axis.title.y = element_text(size=14),
axis.title.x = element_text(size=14))+
xlab("Sample Size")+
ylab("Retest reliability")+
ylim(-0.15, 1)
ggsave('LitReviewRelBySampleSize.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 8, height = 5, units = "in", limitsize = FALSE, dpi = 500)
Highlighting the difference between survey and task reliability for comparability to our results.
rel_comp %>%
ggplot(aes(task, retest_reliability, fill = task))+
geom_boxplot()+
theme(legend.position = 'none',
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=14))+
xlab("")+
ylab("Mean ICC")+
ylim(-0.4,1)
ggsave('LitTaskVsSurvey.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 3, height = 5, units = "in", limitsize = FALSE, dpi = 500)
I used to compare effect sizes of effect of task on reliability estimates in literature vs our results (just looking at the variables you find in the literature) but removed this analysis because
1. the estimates in the literature are not all the same statistic 2. from our data I was only using ICC’s 3. I was sampling from our bootstrapped results as many datapoints for each measure as there are in the literature but that didn’t seem like the best way to make use of our data to make it comparable to the literature.
Despite these problems the main result was that the effect size was much larger in our dataset compared to the literature but given the problems I think the sampling analyses below are more informative than this.
We also checked whether our results diverge most from studies with smaller sample sizes. Square difference between our mean estimate and the reliability from the literature decreases exponentially with sample size. The smaller the sample size in the literature the more the reliability estimate differs from our results. But this was a weak result because most of the studies in the literature have smaller sample sizes and you see both small and large deviations for these studies (these were not significant either).
Correlation between our mean estimates from bootstrapped samples and the literature review for task variables
n_df = rel_comp %>%
group_by(dv) %>%
tally()
lit_emp_cor = function(){
boot_comp = data.frame()
for(i in 1:length(unique(rel_comp$dv)) ){
cur_dv = unique(rel_comp$dv)[i]
n = n_df$n[n_df$dv == cur_dv]
sample_df = boot_df %>% filter(dv == cur_dv)
tmp = sample_n(sample_df, n)
boot_comp = rbind(boot_comp, tmp)
}
rm(cur_dv, n, sample_df, tmp)
#check if cbind is ok
# sum(boot_comp$dv == rel_comp$dv)
#cbinding pearson because that is the most common metric in the lit
rel_comp = cbind(rel_comp, boot_comp$pearson)
#rename new column
names(rel_comp)[which(names(rel_comp) == "boot_comp$pearson")] = "pearson"
out = data.frame(task = NA, survey = NA)
out$task = with(rel_comp %>% filter(task == "task"), cor(pearson, retest_reliability))
out$survey = with(rel_comp %>% filter(task == "survey"), cor(pearson, retest_reliability))
rel_comp = rel_comp[,-16]
return(out)
}
#lit_emp_cor_out = plyr::rdply(100, lit_emp_cor)
#write.csv(lit_emp_cor_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')
lit_emp_cor_out = read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')
summary(lit_emp_cor_out)
X .n task survey
Min. : 1.0 Min. : 1.0 Min. :0.235 Min. :0.0381
1st Qu.: 25.8 1st Qu.: 25.8 1st Qu.:0.263 1st Qu.:0.1162
Median : 50.5 Median : 50.5 Median :0.274 Median :0.1392
Mean : 50.5 Mean : 50.5 Mean :0.274 Mean :0.1381
3rd Qu.: 75.2 3rd Qu.: 75.2 3rd Qu.:0.285 3rd Qu.:0.1582
Max. :100.0 Max. :100.0 Max. :0.323 Max. :0.2478
Model comparisons building model to predict the reliabilities in the literature from a sample from our results versus a sample form the literature.
sample one row per measure out of lit review r^2 of retest_reliability ~ sampled_reliability vs. r^2 of retest_reliability ~ mean_icc
coef of retest_reliability ~ sampled_reliability vs. coef of retest_reliability ~ mean_icc
comp_lit_pred <- function(df){
sample_from_dv <- function(df){
if(nrow(df)>1){
row_num = sample(1:nrow(df),1)
sample_row = df[row_num,]
df = df[-row_num,]
df$lit_predictor = sample_row$retest_reliability
}
return(df)
}
sampled_df = df %>%
group_by(dv) %>%
do(sample_from_dv(.))
if(length(unique(sampled_df$task))>1){
mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size)+task, data=sampled_df)
mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size)+task, data=sampled_df)
}
else{
mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size), data=sampled_df)
mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size), data=sampled_df)
}
out = data.frame(r2_lit = summary(mod_lit)$adj.r.squared,
r2_boot = summary(mod_boot)$adj.r.squared,
m_lit = coef(summary(mod_lit))["lit_predictor","Estimate"],
m_boot = coef(summary(mod_boot))["mean_pearson","Estimate"])
return(out)
}
comp_lit_pred_out = plyr::rdply(1000, comp_lit_pred(rel_comp))
tmp = comp_lit_pred_out %>%
select(-.n, -m_lit, -m_boot) %>%
gather(key, value) %>%
separate(key, c("stat", "sample"), sep = "_")
# tmp$stat = as.factor(tmp$stat)
# levels(tmp$stat) <- c("coefficient", expression(r^"2"))
tmp %>%
ggplot(aes(value, fill=sample))+
geom_density(alpha = 0.5, position='identity', color=NA)+
# facet_grid(.~stat, scales='free', labeller = label_parsed)+
scale_fill_manual(breaks=c("boot","lit"),
labels=c("Empirical", "Literature"),
name="Predictor",
values = c("purple", "orange"))+
xlab('Variance Explained')+
ylab('Density')+
xlim(0,1)+
ylim(0,40)+
theme(axis.title.x = element_text(size=16),
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=16),
legend.title = element_text(size=16),
legend.text = element_text(size=16))
ggsave('LitAndBoot_Noise_Ceiling.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 4, units = "in", limitsize = FALSE, dpi = 500)
with(comp_lit_pred_out, t.test(r2_lit, r2_boot, paired=T))
Paired t-test
data: r2_lit and r2_boot
t = 28, df = 1000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.02444 0.02813
sample estimates:
mean of the differences
0.02628
Are you at noise ceiling for suveys vs. tasks? Are you better in estimating the literature using our data for surveys vs tasks?
comp_lit_pred_out_task = plyr::rdply(1000, comp_lit_pred(rel_comp %>% filter(task=="task")))
comp_lit_pred_out_task$task = "task"
comp_lit_pred_out_survey = plyr::rdply(1000, comp_lit_pred(rel_comp %>% filter(task=="survey")))
comp_lit_pred_out_survey$task = "survey"
comp_lit_pred_out_both = rbind(comp_lit_pred_out_task, comp_lit_pred_out_survey)
comp_lit_pred_out_both %>%
select(-.n, -m_lit, -m_boot) %>%
gather(key, value, -task) %>%
separate(key, c("stat", "sample"), sep = "_") %>%
ggplot(aes(value, fill=sample, alpha=task, color=task))+
geom_density(position='identity', size=2)+
# facet_grid(~task)+
scale_fill_manual(breaks=c("boot","lit"),
labels=c("Empirical", "Literature"),
name="Predictor",
values = c("purple", "orange"))+
scale_alpha_manual(breaks=c("survey", "task"),
values = c(0.4, 0.7),
name="Measure type")+
scale_color_discrete(name="Measure type")+
xlab('Variance Explained')+
ylab('Density')
Are the differences in the distributions significant? The literature does a better job in predicting itself for tasks compared to surveys. s
with(comp_lit_pred_out_both, t.test(r2_lit ~ task))
Welch Two Sample t-test
data: r2_lit by task
t = -53, df = 1800, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.07730 -0.07175
sample estimates:
mean in group survey mean in group task
0.05402 0.12855
Our data is better in predicting the literature estimates for the surveys that it is in predicting tasks.
with(comp_lit_pred_out_both, t.test(r2_lit-r2_boot ~ task))
Welch Two Sample t-test
data: r2_lit - r2_boot by task
t = -13, df = 1900, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02398 -0.01785
sample estimates:
mean in group survey mean in group task
0.02339 0.04430
What does it look like if you estimate the left one out using the remaining ones for the noise ceiling model? Sample one row of literature, fit model of sample size+task on remaining data, use that model to predict the reliability estimates, compare those to the left out reliability estimate from the literature versus the average reliability estimate from our data
pred_sample <- function(df){
get_sample_row <- function(df){
if(nrow(df)>1){
row_num = sample(1:nrow(df),1)
sample_row = df[row_num,]
}
else{
sample_row = df
}
return(sample_row)
}
test_df = df %>%
group_by(dv) %>%
do(get_sample_row(.))
train_df = anti_join(df, test_df)
if(length(unique(train_df$task))>1){
mod = lm(retest_reliability ~ scale(sample_size)+task, data=train_df)
}
else{
mod = lm(retest_reliability ~ scale(sample_size), data=train_df)
}
test_df$pred = predict(mod, test_df)
# out = data.frame(res_lit = sum(test_df$retest_reliability - pred)^2,
# res_boot = sum(test_df$mean_pearson - pred)^2)
return(test_df)
}
Why did I have to run it separately for the function before but can run it on all data once here? Before the output was model fits for all measures. Here the output is measure level prediction.
pred_sample_out = plyr::rdply(1000, pred_sample(rel_comp))
Plot difference between prediction and value of either the lit estimate or the average estimate from our data
tmp = pred_sample_out %>%
select(task,pred, retest_reliability, mean_pearson) %>%
gather(key, value, -task, -pred) %>%
mutate(sq_diff = (pred - value)^2)
tmp%>%
ggplot(aes(sq_diff, fill=key))+
geom_density(position = "identity", color=NA, alpha=0.5)+
scale_fill_manual(breaks=c("mean_pearson","retest_reliability"),
labels=c("Empirical", "Literature"),
name="Predicting",
values = c("purple", "orange"))
Are the means of these distributions different? Yes: the difference is larger when predicting the left-out value from our data versus predicting left out value from literature. The difference seems pretty small though.
with(tmp, t.test(sq_diff ~ key))
Welch Two Sample t-test
data: sq_diff by key
t = 12, df = 340000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.003036 0.004189
sample estimates:
mean in group mean_pearson mean in group retest_reliability
0.05135 0.04774
Plot difference between prediction and value broken down by measure type.
tmp%>%
ggplot(aes(sq_diff, fill=key, color=task, alpha=task))+
geom_density(position = "identity", size=2)+
scale_fill_manual(breaks=c("mean_pearson","retest_reliability"),
labels=c("Empirical", "Literature"),
name="Predicting",
values = c("purple", "orange"))+
scale_alpha_manual(breaks=c("survey", "task"),
values = c(0.4, 0.7),
name="Measure type")+
scale_color_discrete(name="Measure type")
Can we capture this interaction in a single model? Yes. The significant interaction suggests that the squared difference is larger for predicting the literature compared to predicting our data for surveys while for tasks the squared difference is larger when predicting our data compared to predicting the literature. (This fits with other results too: our results differ from the literature more for the survey measures)
aggregate(sq_diff ~ key*task,tmp, FUN=mean)
summary(lm(sq_diff ~ key*task,tmp))
Call:
lm(formula = sq_diff ~ key * task, data = tmp)
Residuals:
Min 1Q Median 3Q Max
-0.0734 -0.0564 -0.0122 0.0074 0.4720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.011573 0.000332 34.9 <2e-16 ***
keyretest_reliability 0.006410 0.000469 13.7 <2e-16 ***
tasktask 0.061839 0.000413 149.6 <2e-16 ***
keyretest_reliability:tasktask -0.015581 0.000585 -26.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0819 on 341996 degrees of freedom
Multiple R-squared: 0.093, Adjusted R-squared: 0.093
F-statistic: 1.17e+04 on 3 and 341996 DF, p-value: <2e-16
Data versus model prediction plot: The predictions are not veriable so they don’t really capture the variability neither in the literature nor in our data very well.
tmp %>%
ggplot(aes(value, pred, color=task))+
geom_point()+
facet_wrap(~key)
Distributions of sum squared residual: Residuals are larger for tasks compared to surveys (i.e. predictions are worse when predicting task variables - possibly because their variability is higher?). For surveys the predictions of the literature are better than predictions of our data while the opposite is true for task measures.
Does this make sense with the results above? Figure out how to think of the noise ceiling analyses and LOOCV results together.
pred_sample_out %>%
mutate(sq_diff_lit = (pred - retest_reliability)^2,
sq_diff_emp = (pred - mean_icc)^2) %>%
group_by(.n, task) %>%
summarise(sum_sq_diff_lit = sum(sq_diff_lit),
sum_sq_diff_emp = sum(sq_diff_emp)) %>%
# select(-'.n') %>%
gather(key, value, -task, -.n) %>%
ggplot(aes(value, fill=key, color=task, alpha=task))+
geom_density(size=1)+
scale_fill_manual(breaks=c("sum_sq_diff_emp","sum_sq_diff_lit"),
labels=c("Empirical", "Literature"),
name="Predicting",
values = c("purple", "orange"))+
scale_alpha_manual(breaks=c("survey", "task"),
values = c(0.4, 0.7),
name="Measure type")+
scale_color_discrete(name="Measure type")
Literature vs. our data on a measure basis
p1 = rel_comp %>%
filter(task == 'task') %>%
ggplot(aes(x = var, y = retest_reliability)) +
geom_boxplot(fill='#00BFC4', position="identity")+
geom_point(aes(x = var, y = mean_pearson),color="purple", fill="purple", size=3, shape=23)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80"),
legend.position = 'bottom') +
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))
p2 = rel_comp %>%
filter(task == 'survey') %>%
ggplot(aes(x = var, y = retest_reliability)) +
geom_boxplot(fill='#F8766D', position="identity")+
geom_point(aes(x = var, y = mean_pearson),color="purple", fill="purple", size=3, shape=23)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80"),
legend.position = 'bottom') +
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))
p3 <- arrangeGrob(p1, p2,nrow=1)
ggsave('LitVsEmp_Measure_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 24, height = 20, units = "in", dpi=100)
rm(p1, p2, p3)
Based on Weir (2005) ICC(3,k) does not take in to account within subject differences between two time points (i.e. the fixed effect of time/systematic error). Thus, it is well approximated by Pearson’s r and subject to similar criticisms. Weir (2005) suggests reporting at least this systematic error effect size if one chooses to report with ICC(3,k). Based on his conclusions here I report:
- ICC(3,k): As Dave clarified this ranges from 1 to -1/(number of repeated measures -1) so in our case this range would be [-1, 1]; larger values would mean that the two scores of a subject for a given measure are more similar to each other than they are to scores of other people
- “ICC is reflective of the ability of a test to differentiate between different individuals” - partial \(\eta^2\) for time (\(SS_{time}/SS_{within}\)): effect size of time
- SEM (\(\sqrt(MS_{error})\)): standard error of measurement; the smaller the better. It “quantifies the precision of individual scores on a test” and is not dependent on the sample in the way ICC is since it doesn’t depend on between subject reliability (at least in this formulation) but is unit-dependent.
We calculated 74 measures for surveys and 372 measures for cognitive tasks.
Though we are primarily reporting ICC’s as our metric of reliability the results don’t change depending on the metric chosen. Here we plot point estimates of three different reliability metrics against each other (ICC, Pearson, Spearman). The bootstrapped version is essentially the same but the plots are busier due to more datapoints.
numeric_cols = c()
for(i in 1:length(names(test_data))){
if(is.numeric(test_data[,i]) & names(test_data)[i] %in% names(retest_data)){
numeric_cols <- c(numeric_cols, names(test_data)[i])
}
}
#Create df of point estimate reliabilities
rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
icc = rep(NA, length(numeric_cols)),
pearson = rep(NA, length(numeric_cols)),
partial_eta_sq = rep(NA, length(numeric_cols)),
sem = rep(NA, length(numeric_cols)),
var_subs = rep(NA, length(numeric_cols)),
var_ind = rep(NA, length(numeric_cols)),
var_resid = rep(NA, length(numeric_cols)))
row.names(rel_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i])
rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i])
rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i])
rel_df[numeric_cols[i], 'partial_eta_sq'] <- get_partial_eta(numeric_cols[i])
rel_df[numeric_cols[i], 'sem'] <- get_sem(numeric_cols[i])
rel_df[numeric_cols[i], 'var_subs'] <- get_var_breakdown(numeric_cols[i])$subs
rel_df[numeric_cols[i], 'var_ind'] <- get_var_breakdown(numeric_cols[i])$ind
rel_df[numeric_cols[i], 'var_resid'] <- get_var_breakdown(numeric_cols[i])$resid
}
rel_df$dv = row.names(rel_df)
row.names(rel_df) = seq(1:nrow(rel_df))
rel_df$task = 'task'
rel_df[grep('survey', rel_df$dv), 'task'] = 'survey'
rel_df[grep('holt', rel_df$dv), 'task'] = "task"
rel_df = rel_df %>%
select(dv, task, spearman, icc, pearson, partial_eta_sq, sem, var_subs, var_ind, var_resid)
# row.names(rel_df) = NULL
table(rel_df$task)
survey task
74 372
p1 = rel_df %>%
ggplot(aes(spearman, icc, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
p2 = rel_df %>%
ggplot(aes(pearson, icc, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
p3 = rel_df %>%
ggplot(aes(pearson, spearman, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
grid.arrange(p1, p2, p3, nrow=1)
ggsave('Metric_Scatterplots.jpg', plot = grid.arrange(p1, p2, p3, nrow=1), device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 12, height = 4, units = "in", limitsize = FALSE, dpi = 100)
As the scatter plots depict the correlations between different types reliability metrics were very high.
cor(rel_df[,c('spearman', 'icc', 'pearson')])
spearman icc pearson
spearman 1.0000 0.9324 0.9577
icc 0.9324 1.0000 0.9800
pearson 0.9577 0.9800 1.0000
Note: Some variables have <0 ICC’s. This would be the case if the \(MS_{error}\)>\(MS_{between}\). Data for these variables have no relationship between the two time points.
Summarized bootstrapped reliabilities
boot_df %>%
group_by(dv) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975)) %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
# Df wrangling for plotting
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')
tmp = tmp %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
separate(var, c("var"), sep="\\.",remove=TRUE,extra="drop") %>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
arrange(task_group, var)
tmp = tmp %>%
left_join(rel_df[,c("dv", "icc")], by = "dv") %>%
rename(icc = icc.x, point_est = icc.y)
#Manual correction
tmp = tmp %>%
mutate(task = ifelse(task_group == 'holt laury survey', "task", as.character(task))) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group))
Variable level summary of bootstrapped reliabilities.
p4 <- tmp %>%
filter(task == 'task',
raw_fit == 'raw') %>%
ggplot(aes(y = var, x = icc)) +
geom_point(color = '#00BFC4')+
geom_point(aes(y = var, x = point_est), color = "black")+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
theme(panel.spacing = unit(0.75, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180, size=36),
axis.text.y = element_text(size=20),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80")) +
xlab("")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_vline(xintercept = 0, color = "red", size = 1)
p5 <- tmp %>%
filter(task == 'survey') %>%
ggplot(aes(y = var, x = icc)) +
geom_point(color = '#F8766D')+
geom_point(aes(y = var, x = point_est), color = "black")+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
theme(panel.spacing = unit(0.75, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180, size=36),
axis.text.y = element_text(size=20),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80")) +
xlab("")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_vline(xintercept = 0, color = "red", size = 1)
p6 <- arrangeGrob(p4, p5,nrow=1)
ggsave('Bootstrap_Raw_Var_Plot.jpg', plot = p6, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 36, height = 72, units = "in", limitsize = FALSE, dpi=50)
rm(p4, p5, p6)
p4_t <- tmp %>%
filter(task == 'task') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#00BFC4')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p5_t <- tmp %>%
filter(task == 'survey') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#F8766D')+
theme_bw() +
theme(axis.text.y = element_text(size=43))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p6_t <- arrangeGrob(p4_t, p5_t,nrow=1)
ggsave('Bootstrap_Poster_Plot_t.jpg', plot = p6_t, device = "jpeg", path = "../output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 100)
Example of the overlaying procedure.
p1<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = var, y = icc)) +
geom_violin(fill='#F8766D')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p2<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc, group=dv)) +
geom_violin(fill='#F8766D', position = 'identity')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p3<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#F8766D', position = 'identity')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p4 <- arrangeGrob(p1,p2, p3,nrow=1)
ggsave('Bootstrap_Example_Plot_t.jpg', plot = p4, device = "jpeg", path = "../output/figures/", width = 41, height = 5, units = "in", limitsize = FALSE, dpi = 100)
rm(p1, p2, p3, p4)
Comparison of survey measures to cognitive task measures in the bootstrapped results. Multilevel model with random intercepts for each measure and fixed effect of survey versus cognitive measure.
boot_df = boot_df %>%
mutate(task = ifelse(grepl("survey",dv), "survey","task"),
task = ifelse(grepl("holt",dv), "task", task))
boot_df %>%
group_by(task) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975),
num_vars = n()/1000) %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
summary(lmerTest::lmer(icc ~ task + (1|dv), boot_df))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ task + (1 | dv)
Data: boot_df
REML criterion at convergence: -942508
Scaled residuals:
Min 1Q Median 3Q Max
-13.201 -0.401 0.024 0.459 7.051
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.05354 0.2314
Residual 0.00701 0.0837
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.8725 0.0269 449.0000 32.4 <2e-16 ***
tasktask -0.3903 0.0295 449.0000 -13.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
The quantiative explanation for the difference in reliability estimates between surveys and tasks, as recently detailed by Hedge et al. (2017), lies in the difference in sources of variance between these measures. Specifically, the ICC is calculated as the ratio of variance between subjects variance to all sources of variance. Thus, measures with high between subjects variance would have high test-retest reliability. Intuitively, measures with high between subjects variance are also better suited for individual difference analyses as they would capture the differences between the subjects in a sample.
Here we first plot the percentage of variance explained by the three sources of variance for the point estimates of measure reliabilities. The plot only includes raw measures (no DDM parameters) and the measures are ranked by percentage of between subject variability for each task/survey (i.e. the best to worst individual difference measure for each task/survey). Then we compare statistically whether the percentage of variance explained by these sources differ between tasks and surveys.
tmp = rel_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
mutate(dv = factor(dv, levels = dv[order(task)])) %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
arrange(task_group, var_subs_pct) %>%
mutate(rank = row_number()) %>%
arrange(task, task_group, rank) %>%
gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
ungroup()%>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
filter(task=="task",
!grepl("EZ|hddm", dv))%>%
arrange(task_group, rank)
labels = tmp %>%
distinct(dv, .keep_all=T)
p1 <- tmp %>%
ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
geom_bar(stat='identity', alpha = 0.75, color='#00BFC4')+
scale_x_discrete(breaks = labels$rank,
labels = labels$var)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey85"),
legend.position = 'bottom')+
theme(legend.title = element_blank())+
scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Variance between individuals",
"Variance between sessions",
"Error variance"),
values=c("grey65", "grey45", "grey25"))+
ylab("")+
xlab("")
tmp = rel_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
mutate(dv = factor(dv, levels = dv[order(task)])) %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
arrange(task_group, var_subs_pct) %>%
mutate(rank = row_number()) %>%
arrange(task, task_group, rank) %>%
gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
ungroup()%>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
filter(task=="survey")%>%
arrange(task_group, rank)
labels = tmp %>%
distinct(dv, .keep_all=T)
p2 <- tmp %>%
ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
geom_bar(stat='identity', alpha = 0.75)+
geom_bar(stat='identity', color='#F8766D', show.legend=FALSE)+
scale_x_discrete(breaks = labels$rank,
labels = labels$var)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey85"),
legend.position = 'bottom')+
theme(legend.title = element_blank())+
scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Variance between individuals",
"Variance between sessions",
"Error variance"),
values=c("grey65", "grey45", "grey25"))+
ylab("")+
xlab("")
mylegend<-g_legend(p2)
p3 <- arrangeGrob(arrangeGrob(p1 +theme(legend.position="none"),
p2 + theme(legend.position="none"),
nrow=1),
mylegend, nrow=2,heights=c(10, 1))
ggsave('Variance_Breakdown_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 24, height = 20, units = "in", dpi = 100)
rm(tmp, labels, p1, p2 , p3)
Comparing types of variance for survey vs task measures: Survey measures have higher between subject variability
Note: This analysis includes DDM variables too.
Running separate models for different sources of variance because interactive model with variance type*task seemed too complicated.
First we find that task measures have a smaller percentage of their overall variance explained by variability between subjects compared to survey measures.
tmp = boot_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct)
summary(lmerTest::lmer(var_subs_pct~task+(1|dv),tmp%>%select(-var_ind_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_subs_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_ind_pct, -var_resid_pct)
REML criterion at convergence: 3304624
Scaled residuals:
Min 1Q Median 3Q Max
-4.316 -0.621 0.103 0.647 5.381
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 188 13.7
Residual 96 9.8
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 80.33 1.60 443.00 50.4 <2e-16 ***
tasktask -28.81 1.75 443.00 -16.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
We also find that a significantly larger percentage of their variance is explained by between session variability. Larger between session variability suggests systematic differences between the two sessions. Such systematic effects can be due to e.g. learning effects as explored later.
summary(lmerTest::lmer(var_ind_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_ind_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_subs_pct, -var_resid_pct)
REML criterion at convergence: 3611691
Scaled residuals:
Min 1Q Median 3Q Max
-4.412 -0.651 -0.139 0.584 4.736
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 239 15.5
Residual 191 13.8
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 9.84 1.80 444.00 5.47 7.4e-08 ***
tasktask 14.36 1.97 444.00 7.29 1.4e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
summary(lmerTest::lmer(var_resid_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_ind_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_resid_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_subs_pct, -var_ind_pct)
REML criterion at convergence: 2777592
Scaled residuals:
Min 1Q Median 3Q Max
-6.323 -0.475 0.018 0.530 6.343
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 75.2 8.67
Residual 29.4 5.43
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 9.83 1.01 446.00 9.75 <2e-16 ***
tasktask 14.45 1.10 446.00 13.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
tmp_save = tmp%>%
gather(key, value, -dv, -task) %>%
group_by(task, key) %>%
summarise(median = median(value),
sd = sd(value)) %>%
mutate(key = ifelse(key == 'var_ind_pct', 'Between session variance', ifelse(key == 'var_subs_pct', 'Between subjects variance', ifelse(key == 'var_resid_pct', 'Residual variance',NA)))) %>%
rename(Median = median, SD = sd) %>%
arrange(task, key)
tmp_save
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/var_breakdown_summary.doc")
| task | key | Median | SD |
|---|---|---|---|
| survey | Between session variance | 5.384 | 11.353 |
| survey | Between subjects variance | 83.355 | 11.686 |
| survey | Residual variance | 8.976 | 4.325 |
| task | Between session variance | 18.077 | 22.114 |
| task | Between subjects variance | 52.548 | 17.681 |
| task | Residual variance | 22.818 | 11.012 |
Summarizing for clearer presentation. This graph is currently using the bootstrapped reliabilities and is therefore messier than if just using the point estimates.
tmp %>%
gather(key, value, -dv, -task) %>%
group_by(task, key) %>%
summarise(mean_pct = mean(value),
sd_pct = sd(value, na.rm=T),
n = n()) %>%
mutate(cvl = qt(0.025, n-1),
cvu = qt(0.975, n-1),
cil = mean_pct+(sd_pct*cvl)/sqrt(n),
ciu = mean_pct+(sd_pct*cvu)/sqrt(n),
sem_pct = sd_pct/sqrt(n)) %>%
ggplot(aes(factor(key, levels = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Between subject variance",
"Within subject variance",
"Error variance")), mean_pct))+
# geom_point(position=position_dodge(width = 0.25), size = 3, aes(color=task))+
geom_bar(position=position_dodge(width = 0.5), width=0.5, aes(fill=task), stat='identity', alpha=0.5)+
# geom_errorbar(aes(ymin=mean_pct-sd_pct, ymax=mean_pct+sd_pct), position=position_dodge(width = 0.25), width=0, size=2)+
geom_errorbar(aes(ymin=cil, ymax=ciu, color=task), position=position_dodge(width = 0.5), width=0.1)+
theme_bw()+
xlab('')+
ylab('Percent of total variance')+
theme(legend.title = element_blank(),
legend.text = element_text(size=14),
# legend.position = 'bottom',
axis.text = element_text(size=12),
axis.title.y = element_text(size=12))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
ylim(0,100)
ggsave('Variance_Breakdown_BarPlot.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 6, height = 5, units = "in", dpi = 300)
The type of ICC we have chosen does not take within subject (between session/systematic) variance in to account. This is why Weir recommends checking whether there is a significant change based on time and examining the SEMs. These systematic effects could be meaningful and important to account for for some measures (e.g. task measures that show learning effects).
Had we chosen another kind of ICC taking this source of variance into account (e.g. 2,1 or 2,k) they could have suggested that tasks have lower reliability.
Doing a simple t-test on the difference score alone would not be a very rigourous way of testing whether any change is meaningful because two distributions from both time points with error would be compared to each other. Fortunately there are ways to take the error for both measurements in to account.
To check whether a measure shows systematic differences between the two time points in a meaningful number of bootstrapped samples we can: check if the effect of time is significant in each bootstrapped sample and filter variables that have more than 5% of the boostrapped samples showing significant time effects.
Another way might be to compute confidence intervals using SEMs as described in the second half of Weir (2005) and check what percent of participants have scores that fall out of this range.
First we ask: Which variables have significant time effects in more than 5% of the bootstrapped samples?
23/74 survey measures 133/372 task measures
boot_df %>%
select(dv, p_time, task) %>%
mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
group_by(dv)%>%
summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
task = unique(task))%>%
filter(pct_sig_time_effect>5) %>%
arrange(task,-pct_sig_time_effect) %>%
ungroup()%>%
group_by(task) %>%
summarise(count=n()) %>%
mutate(total_num_vars = c(74, 372),
prop = count/total_num_vars) %>%
ggplot(aes(task, prop, fill=task))+
geom_bar(stat="identity", alpha=0.5, width=0.5)+
theme(legend.position = "none",
axis.text.x = element_text(size=14),
axis.title.y = element_text(size=14))+
ylab("Proportiono of bootstrap samples")+
xlab("")+
ylim(0,1)
ggsave('PropTimeEffects.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 3, height = 5, units = "in", limitsize = FALSE, dpi = 300)
Are these significantly different between surveys and tasks? No.
chisq.test(matrix(data = c(23,74-23,133, 372-133), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(23, 74 - 23, 133, 372 - 133), nrow = 2)
X-squared = 0.4, df = 1, p-value = 0.5
boot_df %>%
select(dv, p_time, task, pearson) %>%
mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
group_by(dv)%>%
summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
task = unique(task),
mean_pearson = mean(pearson))%>%
filter(pct_sig_time_effect>5) %>%
arrange(task,-pct_sig_time_effect) %>%
arrange(mean_pearson)
Step 1: T = Grand mean + ICC * (Subject score - Grand mean) Step 2: SEP = SD of both measurements * sqrt(1-ICC^2)
Here I calculate the proportion of subjects that move more than one standard error of prediction (SEP) in t2 compared to t1 for each measure.
This is odd to think about because the larger the ICC of a measure the smaller the SEP. So very small differences between the two time points can be categorized as ‘meaningful’ based on the tiny SEP.
What you are interested in is not necessarily whether individuals change at all between the two time points though. You want to know if this change is systematic in one direction.
Here I calculate the proportion of people showing ‘meaningful’ changes in one direction or the other. To integrate both of these direction I subtracted one from the other and filtered the variables that have more than 5% of the participants showing meaningful change in one direction over the other (so if a variable has 10 participants showing difference in one and 10 in the other direction this would cancel out but a variables with 15 people showing a positive and 5 negative change would remain).
get_ind_ci = function(dv_var){
matched = match_t1_t2(dv_var)
grand_mean = mean(matched$score)
grand_sd = sd(matched$score)
dv_icc = get_icc(dv_var)
sep = grand_sd * sqrt(1-(dv_icc^2))
matched = matched %>%
spread(time, score) %>%
rename("t1"="1", "t2"="2") %>%
mutate(true_score = grand_mean+(dv_icc*(t1-grand_mean)),
ci_up = true_score+(1.96*sep),
ci_low = true_score-(1.96*sep),
t2_above_ci = ifelse(t2>ci_up,1,0),
t2_below_ci = ifelse(t2<ci_low,1, 0))
return(matched)
}
get_prop_out_ci = function(dv_var){
get_ind_ci(dv_var) %>%
summarise(prop_above_ci = sum(t2_above_ci)/n(),
prop_below_ci = sum(t2_below_ci/n()))
}
#Create df of point estimate reliabilities
ind_ci_df <- data.frame(prop_above_ci = rep(NA, length(numeric_cols)),
prop_below_ci = rep(NA, length(numeric_cols)))
row.names(ind_ci_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
ind_ci_df[numeric_cols[i], 'prop_above_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_above_ci
ind_ci_df[numeric_cols[i], 'prop_below_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_below_ci
}
ind_ci_df$dv = row.names(ind_ci_df)
row.names(ind_ci_df) = seq(1:nrow(ind_ci_df))
ind_ci_df$task = 'task'
ind_ci_df[grep('survey', ind_ci_df$dv), 'task'] = 'survey'
ind_ci_df[grep('holt', ind_ci_df$dv), 'task'] = "task"
ind_ci_df = ind_ci_df %>%
select(dv, task, prop_above_ci, prop_below_ci)
mean_diff_df = ind_ci_df %>%
mutate(prop_one_direction = prop_above_ci-prop_below_ci) %>%
filter(abs(prop_one_direction)>0.05) %>%
arrange(-abs(prop_one_direction))
mean_diff_df
This doesn’t make it easier to interpret whether it is a performance improvement or decline. In fact ‘performance’ isn’t even necessarily the correct term here. Using the valence assignments of the measures look at whether it translates to an increase or decrease in self-regulation. Of the 66 this makes sense for 54 variables.
valence_df = read.csv(paste0(retest_data_path, 'DV_valence.csv'), header = F)
valence_df = valence_df %>% rename(dv = V1, valence = V2)
valence_df$task = 'task'
valence_df[grep('survey', valence_df$dv), 'task'] = 'survey'
valence_df[grep('holt', valence_df$dv), 'task'] = "task"
valence_df = valence_df %>% filter(dv %in% retest_report_vars)
with(valence_df, table(task, valence))
valence
task -1 1
survey 15 59
task 103 248
#Comparable level of valences for task and survey measures
chisq.test(with(valence_df, table(task, valence)))
Pearson's Chi-squared test with Yates' continuity correction
data: with(valence_df, table(task, valence))
X-squared = 2.1, df = 1, p-value = 0.1
mean_diff_df %>%
left_join(valence_df, by = c("dv", "task")) %>%
mutate(sc_increase = ifelse(prop_one_direction>0&valence==1,1,ifelse(prop_one_direction<0&valence==-1,1,0)),
sc_decrease = ifelse(prop_one_direction>0&valence==-1,1,ifelse(prop_one_direction<0&valence==1,1,0))) %>%
group_by(task) %>%
summarise(sum_sc_increase = sum(sc_increase,na.rm=T),
sum_sc_decrease = sum(sc_decrease,na.rm=T)) %>%
gather(key, value, -task) %>%
mutate(total_vars = c(74, 372, 74, 372),
prop = value/total_vars) %>%
ggplot(aes(key, prop, group=task, fill=task))+
geom_bar(stat = "identity", position = position_dodge(), alpha = 0.5)+
theme(legend.title = element_blank(),
axis.title.y = element_text(size=14),
axis.text.x = element_text(size=14))+
xlab("")+
ylab("Proportion of measures")+
ylim(0, 1)+
scale_x_discrete(breaks = c("sum_sc_decrease", "sum_sc_increase"),
labels = c("Self-control decrease", "Self-control increase"))
ggsave('PropTimeEffectsValence.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 7, height = 5, units = "in", limitsize = FALSE, dpi = 300)
Variables that show more than 5% difference in a direction do not depend on whether they translate to a self control increase or decrease.
chisq.test(matrix(data = c(23,54-23,31,54-31), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(23, 54 - 23, 31, 54 - 31), nrow = 2)
X-squared = 1.8, df = 1, p-value = 0.2
Do they differ depending on whether they are task or survey variables? No.
chisq.test(matrix(data = c(5,74-5,60,372-60), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(5, 74 - 5, 60, 372 - 60), nrow = 2)
X-squared = 3.6, df = 1, p-value = 0.06
rm(mean_diff_df)
Here we summarize the results on a task level to make it more digestable and easier to make contact with the literature.
We reduce the list of task measures to a list of one per task by averaging only the measures that are extracted and used from these tasks in the literature. We call these the ‘meaningful measures.’
meaningful_vars = read.table('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/meaningful_vars.txt')
meaningful_vars = as.character(meaningful_vars$V1)
tmp = as.character(unique(lit_review$dv)[which(unique(lit_review$dv) %in% meaningful_vars == FALSE)])
tmp = tmp[-grep('survey', tmp)]
meaningful_vars = c(meaningful_vars, tmp)
meaningful_vars = sort(meaningful_vars)
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task',
dv %in% meaningful_vars) %>%
left_join(boot_df[,c("dv", "icc")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
group_by(task_name) %>%
summarise(median_icc = median(icc),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc)
tmp %>%
datatable() %>%
formatRound(columns=c('median_icc','icc_2.5','icc_97.5'), digits=3)
tmp = tmp%>%
mutate(task_name = gsub("_", " ", task_name),
task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE))
names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)
sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/task_rel_table.doc")
| Task Name | Median Icc | Icc 2.5 | Icc 97.5 | Num Measures | Mean Num Trials |
|---|---|---|---|---|---|
| Ravens | 0.875 | 0.845 | 0.904 | 1 | 18 |
| Adaptive N Back | 0.842 | 0.362 | 0.911 | 3 | 380 |
| Cognitive Reflection Survey | 0.826 | 0.773 | 0.876 | 1 | 6 |
| Kirby | 0.824 | 0.738 | 0.895 | 8 | 14 |
| Psychological Refractory Period Two Choices | 0.777 | 0.677 | 0.857 | 2 | 140 |
| Simple Reaction Time | 0.738 | 0.668 | 0.826 | 1 | 149 |
| Choice Reaction Time | 0.727 | 0.506 | 0.848 | 2 | 150 |
| Stroop | 0.726 | 0.432 | 0.841 | 9 | 63 |
| Hierbackground-color:#eaeaea;hical Rule | 0.709 | 0.617 | 0.786 | 1 | 360 |
| Stim Selective Stop Signal | 0.693 | 0.542 | 0.768 | 1 | 240 |
| Digit Span | 0.675 | 0.544 | 0.786 | 2 | 10 |
| Information Sampling Task | 0.67 | 0.118 | 0.859 | 8 | 10 |
| Dietary Decision | 0.67 | 0.556 | 0.774 | 1 | 48 |
| Spatial Span | 0.658 | 0.382 | 0.794 | 2 | 10 |
| Stop Signal | 0.654 | 0.255 | 0.817 | 10 | 304 |
| Simon | 0.648 | 0.289 | 0.817 | 9 | 54 |
| Discount Titrate | 0.645 | 0.496 | 0.753 | 1 | 36 |
| Keep Track | 0.641 | 0.497 | 0.719 | 1 | 9 |
| Tower Of London | 0.638 | 0.423 | 0.858 | 3 | 12 |
| Go Nogo | 0.632 | 0.256 | 0.785 | 5 | 175 |
| Columbia Card Task Hot | 0.597 | 0.433 | 0.857 | 5 | 24 |
| Attention Network Task | 0.589 | -0.433 | 0.804 | 10 | 49 |
| Motor Selective Stop Signal | 0.58 | 0.38 | 0.756 | 1 | 90 |
| Recent Probes | 0.539 | -0.369 | 0.84 | 4 | 42 |
| Columbia Card Task Cold | 0.524 | 0.248 | 0.688 | 5 | 24 |
| Two Stage Decision | 0.506 | 0.251 | 0.688 | 2 | 198 |
| Shift Task | 0.469 | -0.095 | 0.662 | 9 | 401 |
| Local Global Letter | 0.452 | -0.112 | 0.704 | 21 | 32 |
| Angling Risk Task Always Sunny | 0.448 | 0.193 | 0.629 | 2 | 21 |
| Dot Pattern Expectancy | 0.442 | -0.086 | 0.776 | 13 | 33 |
| Probabilistic Selection | 0.338 | -0.207 | 0.751 | 6 | 48 |
| Shape Matching | 0.246 | 0.071 | 0.398 | 1 | 139 |
| Directed Forgetting | 0.229 | -0.068 | 0.411 | 2 | 16 |
| Threebytwo | 0.162 | -0.225 | 0.457 | 2 | 29 |
| Holt Laury Survey | 0.15 | -0.219 | 0.386 | 1 | 10 |
| Bickel Titrator | 0.066 | -0.033 | 0.783 | 4 | 30 |
Does number of items in a task have a significant effect on the average ICC of meaningful measures for all trials from a task? No.
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task',
dv %in% meaningful_vars) %>%
left_join(boot_df[,c("dv", "icc")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
Data: tmp
REML criterion at convergence: -340247
Scaled residuals:
Min 1Q Median 3Q Max
-13.380 -0.450 0.042 0.513 6.814
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.06027 0.2455
Residual 0.00683 0.0826
Number of obs: 159000, groups: dv, 159
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 5.09e-01 2.42e-02 1.59e+02 21.04 <2e-16 ***
num_all_trials 1.85e-04 1.57e-04 1.59e+02 1.18 0.24
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
num_ll_trls -0.594
tmp %>%
ggplot(aes(num_all_trials, icc))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")
The above analysis was looking at the effect of number of trials across tasks. But some tasks might be bad for individual difference measurement regardless of how many trials there are in them whereas for others fewer trials might be yielding a sufficiently reliable measure.
For tasks for which dependent variables are estimated using many trials one can ask: Does the same measure get less reliable if fewer trials are used to estimate its reliability?
This won’t make sense for all tasks. For example to estimate a risk aversion parameter you need all trials for Holt and Laury. For Kirby and Bickel you have specific conditions looking at fewer trials. The Cognitive Reflection Task might be more appropriate to analyze each item seaprately. The writing task does not have trial numbers. For all others it might be interesting to investigate.
These kinds of analyses are too task-specific and in-depth for a paper that is trying to give a global sense of the differences between self-regulation measures in their suitablity for individual difference analyses based on their stability across time. Such analyses would provide a detailed examination of how to extract the most reliable/best individual difference measure from tasks with a set of mediocre variables to begin with. Though we do not provide such a comprehensive analysis of this sort in this paper we provide a single example of this approach and hope the open access we provide to the data spurs further work.
For this example we look at the retest reliability of dependent measures from the threebytwo with >400 trials. Here is a graph of how the point estimates of the retest reliability changes for each of the dependent measures using different numbers of trials to estimate them.
Post process dv’s from cluster to calculate reliabilities
t1_dvs = read.csv(paste0(retest_data_path, 't1_tbt_dvs.csv'))
t2_dvs = read.csv(paste0(retest_data_path, 't2_tbt_dvs.csv'))
t1_dvs = t1_dvs %>% select(-X)
t2_dvs = t2_dvs %>% select(-X)
hr_merge = merge(t1_dvs, t2_dvs, by = c("sub_id", "breaks"))
hr_merge = hr_merge %>%
gather(key, value, -sub_id, -breaks) %>%
separate(key, c("dv", "time"), sep="\\.") %>%
mutate(time = ifelse(time == "x", 1, 2))
t1_dvs = hr_merge %>%
filter(time == 1) %>%
select(-time) %>%
spread(dv, value)
t2_dvs = hr_merge %>%
filter(time == 2) %>%
select(-time) %>%
spread(dv, value)
# calculate point estimates for reliability of each of the variables for each break
# get_icc for each break of tmp_t1_dvs and tmp_t2_dvs
trial_num_rel_df = data.frame(breaks=rep(NA, length(unique(t1_dvs$breaks))),
acc_icc=rep(NA, length(unique(t1_dvs$breaks))),
avg_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
std_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
avg_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
std_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
missed_percent_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))))
for(i in 1:length(unique(t1_dvs$breaks))){
cur_break = unique(t1_dvs$breaks)[i]
tmp_t1_dvs = t1_dvs %>% filter(breaks == cur_break)
tmp_t2_dvs = t2_dvs %>% filter(breaks == cur_break)
trial_num_rel_df$breaks[i] = cur_break
trial_num_rel_df$acc_icc[i] = get_icc("acc", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$avg_rt_error_icc[i] = get_icc("avg_rt_error", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$std_rt_error_icc[i] = get_icc("std_rt_error", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$avg_rt_icc[i] = get_icc("avg_rt", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$std_rt_icc[i] = get_icc("std_rt", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$missed_percent_icc[i] = get_icc("missed_percent", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_rt_100_icc[i] = get_icc("cue_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_rt_900_icc[i] = get_icc("cue_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_rt_100_icc[i] = get_icc("task_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_rt_900_icc[i] = get_icc("task_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_acc_100_icc[i] = get_icc("cue_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_acc_900_icc[i] = get_icc("cue_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_acc_100_icc[i] = get_icc("task_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_acc_900_icc[i] = get_icc("task_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
}
rm(i, cur_break, tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$breaks = as.numeric(trial_num_rel_df$breaks)
# write.csv(trial_num_rel_df, paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
Takeaways from this graph: - Reliability estimates stabilize after about an eigthth of the trials (note that this might not be true for other model parameter estimates; here we are only looking at raw response time and accuracy measures) - Only three measures that are not contrast measures have reliabilities >0 - All the other measures are basically completely unreliable (including the two ‘meaningful variables’ in the literature: the cue switch cost RT’s)
trial_num_rel_df = read.csv(paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
cols <- c("Accuracy" = '#084594', "Average RT error" = '#99000d', "SD RT error" = '#cb181d', "Average RT correct" = '#ef3b2c', "SD RT correct" = '#fb6a4a', "Missed percentage" = '#2171b5', "Cue switch cost RT (100)" = '#fc9272', "Cue switch cost RT (900)" = '#fcbba1', "Task switch cost RT (100)" = '#fee0d2', "Task switch cost RT (900)" = '#fff5f0', "Cue switch cost Acc (100)" = '#4292c6', "Cue switch cost Acc (900)" = '#6baed6', "Task switch cost Acc (100)" = '#9ecae1', "Task switch cost Acc (900)" = '#c6dbef')
cols <- c("Accuracy" = '#084594', "Average RT error" = '#99000d', "Average RT correct" = '#ef3b2c', "Missed percentage" = '#2171b5', "Cue switch cost RT (100)" = '#fc9272', "Cue switch cost RT (900)" = '#fcbba1', "Task switch cost RT (100)" = '#fee0d2', "Task switch cost RT (900)" = '#fff5f0', "Cue switch cost Acc (100)" = '#4292c6', "Cue switch cost Acc (900)" = '#6baed6', "Task switch cost Acc (100)" = '#9ecae1', "Task switch cost Acc (900)" = '#c6dbef')
trial_num_rel_df %>%
select(-std_rt_icc, -std_rt_error_icc)%>%
gather(key, value, -breaks) %>%
ggplot(aes(breaks*10, value, color=factor(key, levels = c("acc_icc", "avg_rt_error_icc", "avg_rt_icc", "missed_percent_icc", "cue_switch_cost_rt_100_icc", "cue_switch_cost_rt_900_icc", "task_switch_cost_rt_100_icc", "task_switch_cost_rt_900_icc", "cue_switch_cost_acc_100_icc", "cue_switch_cost_acc_900_icc", "task_switch_cost_acc_100_icc", "task_switch_cost_acc_900_icc"), labels = c("Accuracy", "Average RT error", "Average RT correct", "Missed percentage", "Cue switch cost RT (100)", "Cue switch cost RT (900)", "Task switch cost RT (100)", "Task switch cost RT (900)", "Cue switch cost Acc (100)", "Cue switch cost Acc (900)", "Task switch cost Acc (100)", "Task switch cost Acc (900)"))))+
geom_point()+
geom_line()+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")+
theme(legend.title = element_blank(),
legend.position = 'bottom')+
scale_color_manual(values = cols,
breaks = c("Average RT correct", "Average RT error", "Cue switch cost RT (100)","Cue switch cost RT (900)", "Task switch cost RT (100)","Task switch cost RT (900)","Accuracy", "Missed percentage", "Cue switch cost Acc (100)", "Cue switch cost Acc (900)", "Task switch cost Acc (100)", "Task switch cost Acc (900)"))+
ylim(-1,1)+
guides(color = guide_legend(ncol=2, byrow=F))
ggsave('Intrameasure_Trialnum_Dependendence.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 10, height = 5, units = "in", dpi = 100)
Checking DDM results in the bootstrapped estimates. Variables using all trials are significantly more reliable compared to difference scores. Raw measures don’t differ from DDM parameters. Which DDM is better depends on whether all trials are used.
tmp = measure_labels %>%
mutate(dv = as.character(dv),
contrast = ifelse(overall_difference == "difference", "contrast", "non-contrast")) %>%
filter(ddm_task == 1,
rt_acc != 'other') %>%
drop_na() %>%
left_join(boot_df[,c("dv", "icc")], by = 'dv')
tmp_save = tmp %>%
drop_na() %>% #try removing this in final release
group_by(contrast, raw_fit, rt_acc) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
num_vars = n()/1000)
tmp_save %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5'), digits=3)
tmp_save = tmp_save %>%
ungroup() %>%
mutate(contrast = as.character(contrast),
raw_fit = as.character(raw_fit),
rt_acc = as.character(rt_acc)) %>%
arrange(-icc_median)
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/ddm_rel_table.doc")
| contrast | raw_fit | rt_acc | icc_median | icc_2.5 | icc_97.5 | num_vars |
|---|---|---|---|---|---|---|
| non-contrast | EZ | drift rate | 0.724 | 0.528 | 0.826 | 14 |
| non-contrast | raw | rt | 0.7 | 0.378 | 0.838 | 41 |
| non-contrast | hddm | drift rate | 0.684 | 0.387 | 0.816 | 16 |
| non-contrast | EZ | non-decision | 0.647 | 0.197 | 0.794 | 14 |
| non-contrast | EZ | threshold | 0.624 | 0.384 | 0.823 | 14 |
| non-contrast | raw | accuracy | 0.583 | -0.267 | 0.825 | 32 |
| non-contrast | hddm | threshold | 0.544 | 0.316 | 0.668 | 14 |
| non-contrast | hddm | non-decision | 0.449 | -0.032 | 0.671 | 14 |
| contrast | hddm | drift rate | 0.404 | -0.242 | 0.651 | 18 |
| contrast | raw | rt | 0.299 | -0.193 | 0.63 | 25 |
| contrast | raw | accuracy | 0.27 | -0.245 | 0.698 | 13 |
| contrast | EZ | drift rate | 0.248 | -0.196 | 0.579 | 17 |
| contrast | EZ | non-decision | 0.236 | -0.204 | 0.652 | 17 |
| contrast | EZ | threshold | 0.097 | -0.449 | 0.537 | 17 |
| contrast | hddm | threshold | 0.09 | -0.201 | 0.345 | 1 |
Comparing contrast vs non-contrast: overall has higher reliability than difference.
summary(lmerTest::lmer(icc ~ contrast + (1|dv) ,tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ contrast + (1 | dv)
Data: tmp
REML criterion at convergence: -497527
Scaled residuals:
Min 1Q Median 3Q Max
-11.647 -0.472 0.041 0.525 6.220
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.03487 0.1867
Residual 0.00901 0.0949
Number of obs: 267000, groups: dv, 267
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.2451 0.0180 266.2000 13.6 <2e-16 ***
contrastnon-contrast 0.3557 0.0233 266.2000 15.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
cntrstnn-cn -0.772
Comparing raw vs ddm in overall estimates: EZ is significantly better than HDDM and comparable to raw estimates.
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(contrast == "non-contrast")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
Data: tmp %>% filter(contrast == "non-contrast")
REML criterion at convergence: -379104
Scaled residuals:
Min 1Q Median 3Q Max
-11.269 -0.496 0.034 0.529 8.073
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.02700 0.1643
Residual 0.00535 0.0731
Number of obs: 159000, groups: dv, 159
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.6493 0.0254 156.8000 25.61 <2e-16 ***
raw_fithddm -0.1068 0.0355 156.8000 -3.01 0.003 **
raw_fitraw -0.0413 0.0318 156.8000 -1.30 0.196
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) rw_fth
raw_fithddm -0.715
raw_fitraw -0.797 0.570
Comparing raw vs ddm in difference scores: EZ is significantly worse than HDDM and comparable to raw estimates.
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(contrast == "contrast")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
Data: tmp %>% filter(contrast == "contrast")
REML criterion at convergence: -150642
Scaled residuals:
Min 1Q Median 3Q Max
-9.214 -0.535 0.065 0.615 4.172
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.0423 0.206
Residual 0.0144 0.120
Number of obs: 108000, groups: dv, 108
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.1900 0.0288 105.1000 6.59 1.8e-09 ***
raw_fithddm 0.1425 0.0553 105.1000 2.58 0.011 *
raw_fitraw 0.0855 0.0441 105.1000 1.94 0.055 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) rw_fth
raw_fithddm -0.521
raw_fitraw -0.653 0.340
tmp %>%
ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
geom_boxplot()+
facet_wrap(~factor(contrast, levels=c("non-contrast", "contrast"), labels=c("Non-contrast", "Contrast")))+
theme_bw()+
ylab("ICC")+
xlab("")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
legend.text = element_text(size=16),
strip.text = element_text(size=16),
axis.text = element_text(size = 16),
text = element_text(size=16))+
guides(fill = guide_legend(ncol = 2))+
scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))
ggsave('Bootstrap_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
tmp2 = measure_labels %>%
mutate(dv = as.character(dv),
contrast = ifelse(overall_difference == "difference", "contrast", "non-contrast")) %>%
filter(ddm_task == 1,
rt_acc != 'other') %>%
drop_na() %>%
left_join(rel_df[,c("dv", "icc")], by = 'dv')
tmp2 %>%
ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
geom_boxplot()+
facet_wrap(~factor(contrast, levels=c("non-contrast", "contrast"), labels=c("Non-contrast", "Contrast")))+
theme_bw()+
ylab("ICC")+
xlab("")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
legend.text = element_text(size=16),
strip.text = element_text(size=16),
axis.text = element_text(size = 16),
text = element_text(size=16))+
guides(fill = guide_legend(ncol = 2))+
scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))+
ylim(-0.4, 1)
ggsave('PointEst_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 15, height = 7, units = "in", limitsize = FALSE)
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'survey') %>%
left_join(boot_df[,c("dv", "icc")], by = 'dv') %>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
tmp_save = tmp %>%
group_by(task_name) %>%
summarise(median_icc = median(icc),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc)
tmp_save %>%
datatable() %>%
formatRound(columns=c('median_icc', 'icc_2.5','icc_97.5'), digits=3)
tmp_save = tmp_save%>%
mutate(task_name = gsub("_", " ", task_name),
task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE),
task_name = gsub("Survey", "", task_name))
names(tmp_save) = gsub("_", " ", names(tmp_save))
names(tmp_save) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp_save), perl=TRUE)
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/survey_rel_table.doc")
| Task Name | Median Icc | Icc 2.5 | Icc 97.5 | Num Measures | Mean Num Trials |
|---|---|---|---|---|---|
| Self Regulation | 0.95 | 0.936 | 0.961 | 1 | 31 |
| Grit Scale | 0.942 | 0.924 | 0.958 | 1 | 8 |
| Brief Self Control | 0.941 | 0.92 | 0.966 | 1 | 13 |
| Ten Item Personality | 0.936 | 0.852 | 0.966 | 5 | 2 |
| Time Perspective | 0.92 | 0.865 | 0.954 | 5 | 11 |
| Five Facet Mindfulness | 0.909 | 0.828 | 0.965 | 6 | 13 |
| Bis Bas | 0.904 | 0.851 | 0.958 | 5 | 7 |
| Dospert Rt | 0.903 | 0.855 | 0.953 | 5 | 6 |
| Impulsive Venture | 0.9 | 0.815 | 0.95 | 2 | 17 |
| Sensation Seeking | 0.899 | 0.852 | 0.951 | 5 | 16 |
| Mindful Attention Awareness | 0.897 | 0.862 | 0.928 | 1 | 15 |
| Erq | 0.896 | 0.827 | 0.933 | 2 | 5 |
| Upps Impulsivity | 0.886 | 0.718 | 0.959 | 5 | 12 |
| Mpq Control | 0.879 | 0.796 | 0.95 | 1 | 24 |
| Eating | 0.872 | 0.813 | 0.914 | 4 | 9 |
| Bis11 | 0.864 | 0.63 | 0.939 | 4 | 15 |
| Dickman | 0.85 | 0.721 | 0.91 | 2 | 12 |
| Future Time Perspective | 0.844 | 0.792 | 0.903 | 1 | 10 |
| Dospert Rp | 0.842 | 0.761 | 0.901 | 5 | 6 |
| Dospert Eb | 0.8 | 0.711 | 0.868 | 5 | 6 |
| Selection Optimization Compensation | 0.799 | 0.598 | 0.898 | 4 | 12 |
| Leisure Time Activity | 0.798 | 0.731 | 0.857 | 1 | 1 |
| Theories Of Willpower | 0.794 | 0.725 | 0.86 | 1 | 12 |
Does number of items in a survey have a significant effect on the average ICC of survey measures? No.
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
Data: tmp
REML criterion at convergence: -320933
Scaled residuals:
Min 1Q Median 3Q Max
-10.265 -0.516 -0.004 0.540 6.540
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.003327 0.0577
Residual 0.000673 0.0259
Number of obs: 72000, groups: dv, 72
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.864206 0.011510 70.400000 75.08 <2e-16 ***
num_all_trials 0.001071 0.000915 70.400000 1.17 0.25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
num_ll_trls -0.807
tmp %>%
ggplot(aes(num_all_trials, icc))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")